24 research outputs found

    Inference in Switching Linear Dynamical Systems Applied to Noise Robust Speech Recognition of Isolated Digits

    Get PDF
    Real world applications such as hands-free dialling in cars may have to perform recognition of spoken digits in potentially very noisy environments. Existing state-of-the-art solutions to this problem use feature-based Hidden Markov Models~(HMMs), with a preprocessing stage to clean the noisy signal. However, the effect that the noise has on the induced HMM features is difficult to model exactly and limits the performance of the HMM system. An alternative to feature-based HMMs is to model the clean speech waveform directly, which has the potential advantage that including an explicit model of additive noise is straightforward. One of the most simple model of the clean speech waveform is the autoregressive~(AR) process. Being too simple to cope with the nonlinearity of the speech signal, the AR~process is generally embedded into a more elaborate model, such as the Switching Autoregressive HMM~(SAR-HMM). In this thesis, we extend the SAR-HMM to jointly model the clean speech waveform and additive Gaussian white noise. This is achieved by using a Switching Linear Dynamical System~(SLDS) whose internal dynamics is autoregressive. On an isolated digit recognition task where utterances have been corrupted by additive Gaussian white noise, the proposed~SLDS outperforms a state-of-the-art HMM system. For more natural noise sources, at low signal to noise ratios~(SNRs), it is also significantly more accurate than a feature-based HMM~system. Inferring the clean waveform from the observed noisy signal with a~SLDS is formally intractable, resulting in many approximation strategies in the literature. In this thesis, we present the Expectation Correction~(EC) approximation. The algorithm has excellent numerical performance compared to a wide range of competing techniques, and provides a stable and accurate linear-time approximation which scales well to long time series such as those found in acoustic modelling. A fundamental issue faced by models based on AR~processes is that they are sensitive to variations in the amplitude of the signal. One way to overcome this limitation is to use Gain Adaptation~(GA) to adjust the amplitude by maximising the likelihood of the observed signal. However, adjusting model parameters without constraint may lead to overfitting when the models are sufficiently flexible. In this thesis, we propose a statistically principled alternative based on an exact Bayesian procedure in which priors are explicitly defined on the parameters of the underlying AR~process. Compared to~GA, the Bayesian approach enhances recognition accuracy at high~SNRs, but is slightly less accurate at low~SNRs

    A Bayesian Switching Linear Dynamical System for Scale-Invariant robust speech extraction

    Get PDF
    Most state-of-the-art automatic speech recognition (ASR) systems deal with noise in the environment by extracting noise robust features which are subsequently modelled by a Hidden Markov Model (HMM). A limitation of this feature-based approach is that the influence of noise on the features is difficult to model explicitly and the HMM is typically over sensitive, dealing poorly with unexpected and severe noise environments. An alternative is to model the raw signal directly which has the potential advantage of allowing noise to be explicitly modelled. A popular way to model raw speech signals is to use an Autoregressive (AR) process. AR models are however very sensitive to variations in the amplitude of the signal. Our proposed Bayesian Autoregressive Switching Linear Dynamical System (BAR-SLDS) treats the observed noisy signal as a scaled, clean hidden signal plus noise. The variance of the noise and signal scaling factor are automatically adapted, enabling the robust identification of scale-invariant clean signals in the presence of noise

    Switching Linear Dynamical Systems for Noise Robust Speech Recognition

    Get PDF
    Real world applications such as hands-free speech recognition of isolated digits may have to deal with potentially very noisy environments. Existing state-of-the-art solutions to this problem use feature-based HMMs, with a preprocessing stage to clean the noisy signal. However, the effect that raw signal noise has on the induced HMM features is poorly understood, and limits the performance of the HMM system. An alternative to feature-based HMMs is to model the raw signal, which has the potential advantage that including an explicit noise model is straightforward. Here we jointly model the dynamics of both the raw speech signal and the noise, using a Switching Linear Dynamical System (SLDS). The new model was tested on isolated digit utterances corrupted by Gaussian noise. Contrary to the SAR-HMM, which provides a model of uncorrupted raw speech, the SLDS is comparatively noise robust and also significantly outperforms a state-of-the-art feature-based HMM. The computational complexity of the SLDS scales exponentially with the length of the time series. To counter this we use Expectation Correction which provides a stable and accurate linear-time approximation for this important class of models, aiding their further application in acoustic modelling

    A Bayesian Alternative to Gain Adaptation in Autoregressive Hidden Markov Models

    Get PDF
    Models dealing directly with the raw acoustic speech signal are an alternative to conventional feature-based HMMs. A popular way to model the raw speech signal is by means of an autoregressive (AR) process. Being too simple to cope with the nonlinearity of the speech signal, the AR process is generally embedded into a more elaborate model, such as the switching autoregressive HMM (SAR-HMM). A fundamental issue faced by models based on AR processes is that they are very sensitive to variations in the amplitude of the signal. One way to overcome this limitation is to use Gain Adaptation to adjust the amplitude by maximising the likelihood of the observed signal. However, adjusting model parameters by maximising test likelihoods is fundamentally outside the framework of standard statistical approaches to machine learning, since this may lead to overfitting when the models are sufficiently flexible. We propose a statistically principled alternative based on an exact Bayesian procedure in which priors are explicitly defined on the parameters of the AR process. Explicitly, we present the Bayesian SAR-HMM and compare the performance of this model against the standard Gain-Adapted SAR-HMM on a single digit recognition task, showing the effectiveness of the approach and suggesting thereby a principled and straightforward solution to the issue of Gain Adaptation

    Construction and comparison of approximations for switching linear gaussian state space models

    Get PDF
    We introduce a new method for approximate inference in Hybrid Dynamical Graphical models, in particular, for switching dynamical networks. For the important special case of switching linear Gaussian state space models (switching Kalman Filters), our method is a novel form of Gaussian sum smoother, consisting of a single forward and backward pass. Our method is particularly well suited to switching observation models, since one of the key approximations is obviated. We compare our method very favourably against a range of competing techniques, including sequential Monte Carlo and Expectation Propagation, for which we also derive a novel numerically more stable implementation using the `auxiliary variable trick'. We show that the use of mixture representations for both filtering and smoothing can dramatically improve the quality of the approximation

    A Spectrogram Model for Enhanced Source Localization and Noise-Robust ASR

    Get PDF
    This paper proposes a simple, computationally efficient 2-mixture model approach to discrimination between speech and background noise. It is directly derived from observations on real data, and can be used in a fully unsupervised manner, with the EM algorithm. A first application to sector-based, joint audio source localization and detection, using multiple microphones, confirms that the model can provide major enhancement. A second application to the single channel speech recognition task in a noisy environment yields major improvement on stationary noise and promising results on non-stationary noise

    A Frequency-Domain Silence Noise Model

    Get PDF
    This paper proposes a simple, computationally efficient 2-mixture model approach to discrimination between speech and background noise. It is directly derived from observations on real data, and can be used in a fully unsupervised manner, with the EM algorithm. A first application to sector-based, joint audio source localization and detection, using multiple microphones, confirms that the model can provide major enhancement. A second application to the single channel speech recognition task in a noisy environment yields major improvement on stationary noise and promising results on non-stationary noise

    Unsupervised Spectral Subtraction for Noise-Robust ASR

    Get PDF
    This paper proposes a simple, computationally efficient 2-mixture model approach to discriminate between speech and background noise at the magnitude spectrogram level. It is directly derived from observations on real data, and can be used in a fully unsupervised manner, with the EM algorithm. In this paper, the 2-mixture model is used in an ``Unsupervised Spectral Subtraction'' scheme that can be applied as a pre-processing step for any acoustic feature extraction scheme, such as MFCCs or PLP. The goal is to improve noise-robustness of the acoustic features. Experimental results on both OGI Numbers 95 and Aurora 2 tasks yielded a major improvement on all noise conditions, while retaining a similar performance on clean conditions

    Unsupervised Spectral Substraction for Noise-Robust ASR

    Get PDF
    This paper proposes a simple, computationally efficient \mbox{2-mixture} model approach to discriminate between speech and background noise at the magnitude spectrogram level. It is directly derived from observations on real data, and can be used in a fully unsupervised manner, with the EM algorithm. In this paper, the 2-mixture model is used in an ``Unsupervised Spectral Substraction'' scheme that can be applied as a pre-processing step for any acoustic feature extraction scheme, such as MFCCs or PLP. The goal is to improve noise-robustness of the acoustic features. Experimental results on both OGI~Numbers~95 and Aurora~2 tasks yielded a major improvement on all noise conditions, while retaining a similar performance on clean conditions
    corecore